Incorporating ENCODE information into association analysis of whole genome sequencing data
نویسندگان
چکیده
With the rapidly decreasing cost of the next-generation sequencing technology, a large number of whole genome sequences have been generated, enabling researchers to survey rare variants in the protein-coding and regulatory regions of the genome. However, it remains a daunting task to identify functional variants associated with complex diseases from whole genome sequencing (WGS) data because of the millions of candidate variants and yet moderate sample size. We propose to incorporate the Encyclopedia of DNA Elements (ENCODE) information in the association analysis of WGS data to boost the statistical power. We use the RegulomeDB and PolyPhen2 scores as external weights in existing rare variants association tests. We demonstrate the proposed framework using the WGS data and blood pressure phenotype from the San Antonio Family Studies provided by the Genetic Analysis Workshop 19. We identified a genome-wide significant locus in gene SNUPN on chromosome 15 that harbors a rare nonsynonymous variant, which was not detected by benchmark methods that did not incorporate biological information, including the T5 burden test and sequence kernel association test.
منابع مشابه
Using linkage analysis of large pedigrees to guide association analyses
To date, genome-wide association studies have yielded discoveries of common variants that partly explain familial aggregation of diseases and traits. Researchers are now turning their attention to less common variants because the price of sequencing has dropped drastically. However, because sequencing of the whole genome in large samples is costly, great care must be taken to prioritize which s...
متن کاملExtended T2 tests for longitudinal family data in whole genome sequencing studies
Family data and rare variants are two key features of whole genome sequencing analysis for hunting the missing heritability of common human diseases. Recently, Zhu and Xiong proposed the generalized T(2) tests that combine rare variant analysis and family data analysis. In similar fashion, we developed the extended T(2) tests for longitudinal whole genome sequencing data for family-based associ...
متن کاملValue of Mendelian laws of segregation in families: data quality control, imputation, and beyond.
When analyzing family data, we dream of perfectly informative data, even whole-genome sequences (WGSs) for all family members. Reality intervenes, and we find that next-generation sequencing (NGS) data have errors and are often too expensive or impossible to collect on everyone. The Genetic Analysis Workshop 18 working groups on quality control and dropping WGSs through families using a genome-...
متن کاملGenome-wide joint analysis of single-nucleotide variant sets and gene expression for hypertension and related phenotypes
BACKGROUND With the advance of next-generation sequencing technologies, the study of rare variants in targeted genome regions or even the whole genome becomes feasible. Nevertheless, the massive amount of sequencing data brings great computational and statistical challenges for association analyses. Aside from sequencing variants, other high-throughput omic data (eg, gene expression data) also ...
متن کاملCWig: compressed representation of Wiggle/BedGraph format
MOTIVATION BigWig, a format to represent read density data, is one of the most popular data types. They can represent the peak intensity in ChIP-seq, the transcript expression in RNA-seq, the copy number variation in whole genome sequencing, etc. UCSC Encode project uses the bigWig format heavily for storage and visualization. Of 5.2 TB Encode hg19 database, 1.6 TB (31% of the total space) is u...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 10 شماره
صفحات -
تاریخ انتشار 2016